import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import warnings
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud
import plotly.express as px
import plotly.graph_objects as go
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import NMF
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix, make_scorer
warnings.filterwarnings("ignore")
There are 3 datasets provided for this project: Sample Solution, Testing, and Trainining data. I will first import all the data.
# Read in data
bbc_news_samp_solution = pd.read_csv('data/BBC_News_Sample_Solution.csv')
The solution dataset shows what the solution should look like
bbc_news_test = pd.read_csv('data/BBC_News_Test.csv')
bbc_news_train = pd.read_csv('data/BBC_News_Train.csv')
In this section, I will share visualizations and describe data cleaning procedures.
#view solution dataset
bbc_news_samp_solution.head()
| ArticleId | Category | |
|---|---|---|
| 0 | 1018 | sport |
| 1 | 1319 | tech |
| 2 | 1138 | business |
| 3 | 459 | entertainment |
| 4 | 1020 | politics |
bbc_news_samp_solution.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 735 entries, 0 to 734 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ArticleId 735 non-null int64 1 Category 735 non-null object dtypes: int64(1), object(1) memory usage: 11.6+ KB
The solution dataset has 735 non-null values (rows) and 2 columns. The columns are Article ID which refers to an article and Category which is the category of the article.
bbc_news_test.head()
| ArticleId | Text | |
|---|---|---|
| 0 | 1018 | qpr keeper day heads for preston queens park r... |
| 1 | 1319 | software watching while you work software that... |
| 2 | 1138 | d arcy injury adds to ireland woe gordon d arc... |
| 3 | 459 | india s reliance family feud heats up the ongo... |
| 4 | 1020 | boro suffer morrison injury blow middlesbrough... |
bbc_news_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 735 entries, 0 to 734 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ArticleId 735 non-null int64 1 Text 735 non-null object dtypes: int64(1), object(1) memory usage: 11.6+ KB
The test dataset also has 735 non-null values and 2 columns. The columns are Article ID and Text which is the actual text of the article.
bbc_news_train.head()
| ArticleId | Text | Category | |
|---|---|---|---|
| 0 | 1833 | worldcom ex-boss launches defence lawyers defe... | business |
| 1 | 154 | german business confidence slides german busin... | business |
| 2 | 1101 | bbc poll indicates economic gloom citizens in ... | business |
| 3 | 1976 | lifestyle governs mobile choice faster bett... | tech |
| 4 | 917 | enron bosses in $168m payout eighteen former e... | business |
bbc_news_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1490 entries, 0 to 1489 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ArticleId 1490 non-null int64 1 Text 1490 non-null object 2 Category 1490 non-null object dtypes: int64(1), object(2) memory usage: 35.0+ KB
The training dataset has 1490 non-null values and 3 columns which are ArticleID, Text of article, and Category of article. The training data has an extra Category column which tells us what the category for each article is.
Check for empty string values in columns:
bbc_news_train[bbc_news_train['Text'] == '']
| ArticleId | Text | Category |
|---|
bbc_news_train[bbc_news_train['Category'] == '']
| ArticleId | Text | Category |
|---|
None of the tables seem to have null values.
I will first view the distribution of how many articles are in each category.
fig = px.histogram(bbc_news_train, x="Category", title='Number of Articles for each Category - training data', nbins=20)
fig.show()
From the plot above, it can be seen that the categories sport and business have the most number of articles and others have less, with tech having the lowest number of articles.
Next, I want to visualize the length of the text column and understand what the distribution is.
bbc_news_train['Text_len'] = bbc_news_train['Text'].apply(lambda text: len(text.split()))
fig = px.histogram(bbc_news_train, x="Text_len", nbins=100, title="Text Length Histogram")
fig.show()
From the plot above, it can be seen that there somewhat of a normal distribution of the text length, but it is skewed.
Next I will clean the text data and then process it using TF-IDF
# View text data
bbc_news_train['Text'][0]
'worldcom ex-boss launches defence lawyers defending former worldcom chief bernie ebbers against a battery of fraud charges have called a company whistleblower as their first witness. cynthia cooper worldcom s ex-head of internal accounting alerted directors to irregular accounting practices at the us telecoms giant in 2002. her warnings led to the collapse of the firm following the discovery of an $11bn (£5.7bn) accounting fraud. mr ebbers has pleaded not guilty to charges of fraud and conspiracy. prosecution lawyers have argued that mr ebbers orchestrated a series of accounting tricks at worldcom ordering employees to hide expenses and inflate revenues to meet wall street earnings estimates. but ms cooper who now runs her own consulting business told a jury in new york on wednesday that external auditors arthur andersen had approved worldcom s accounting in early 2001 and 2002. she said andersen had given a green light to the procedures and practices used by worldcom. mr ebber s lawyers have said he was unaware of the fraud arguing that auditors did not alert him to any problems. ms cooper also said that during shareholder meetings mr ebbers often passed over technical questions to the company s finance chief giving only brief answers himself. the prosecution s star witness former worldcom financial chief scott sullivan has said that mr ebbers ordered accounting adjustments at the firm telling him to hit our books . however ms cooper said mr sullivan had not mentioned anything uncomfortable about worldcom s accounting during a 2001 audit committee meeting. mr ebbers could face a jail sentence of 85 years if convicted of all the charges he is facing. worldcom emerged from bankruptcy protection in 2004 and is now known as mci. last week mci agreed to a buyout by verizon communications in a deal valued at $6.75bn.'
In the example above, there are lots of punctions so I will remove them using regex.
# get rid of punctuation
bbc_news_train['Text'] = bbc_news_train['Text'].str.replace('[^\w\s]','')
bbc_news_test['Text'] = bbc_news_test['Text'].str.replace('[^\w\s]','')
bbc_news_train['Text'][0]
'worldcom exboss launches defence lawyers defending former worldcom chief bernie ebbers against a battery of fraud charges have called a company whistleblower as their first witness cynthia cooper worldcom s exhead of internal accounting alerted directors to irregular accounting practices at the us telecoms giant in 2002 her warnings led to the collapse of the firm following the discovery of an 11bn 57bn accounting fraud mr ebbers has pleaded not guilty to charges of fraud and conspiracy prosecution lawyers have argued that mr ebbers orchestrated a series of accounting tricks at worldcom ordering employees to hide expenses and inflate revenues to meet wall street earnings estimates but ms cooper who now runs her own consulting business told a jury in new york on wednesday that external auditors arthur andersen had approved worldcom s accounting in early 2001 and 2002 she said andersen had given a green light to the procedures and practices used by worldcom mr ebber s lawyers have said he was unaware of the fraud arguing that auditors did not alert him to any problems ms cooper also said that during shareholder meetings mr ebbers often passed over technical questions to the company s finance chief giving only brief answers himself the prosecution s star witness former worldcom financial chief scott sullivan has said that mr ebbers ordered accounting adjustments at the firm telling him to hit our books however ms cooper said mr sullivan had not mentioned anything uncomfortable about worldcom s accounting during a 2001 audit committee meeting mr ebbers could face a jail sentence of 85 years if convicted of all the charges he is facing worldcom emerged from bankruptcy protection in 2004 and is now known as mci last week mci agreed to a buyout by verizon communications in a deal valued at 675bn'
As can be seen above, with punctuation removed it will be easier to model.
Next, I will create a function to remove stopwords:
# import the nltk stopwords
nltk.download('stopwords')
stemmer = nltk.stem.PorterStemmer()
ENGLISH_STOP_WORDS = stopwords.words('english')
# tokenizer function to get rid of stop words
def my_tokenizer(sentence):
listofwords = sentence.split(' ')
listofstemmed_words = []
for word in listofwords:
if (not word in ENGLISH_STOP_WORDS) and (word!=''):
stemmed_word = stemmer.stem(word)
listofstemmed_words.append(stemmed_word)
return listofstemmed_words
[nltk_data] Downloading package stopwords to [nltk_data] /Users/puneetsran/nltk_data... [nltk_data] Package stopwords is already up-to-date!
# plot function to view most frequent words
def plot_most_frequent(words, word_counts, top=20):
words_df = pd.DataFrame({"token": words,
"count": word_counts})
fig, ax = plt.subplots(figsize=(0.75*top, 5))
words_df.sort_values(by="count", ascending=False).head(top)\
.set_index("token")\
.plot(kind="bar", rot=45, ax=ax)
sns.despine()
plt.title("Most frequent tokens")
plt.show()
I will be using the TD-IDF to process the text. It's widely used fot text mining and classification. It is a method for creating document-term matrices. It's similar to bagofwords, but it stores a measure of the relevance of every word in each document by reweighing the counts. TD-IDF can be broken down as follows:
# using custom tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=5, tokenizer=my_tokenizer)
tfidf.fit(bbc_news_train['Text'])
bbc_news_train_transformed = tfidf.transform(bbc_news_train['Text'])
bbc_news_test_transformed = tfidf.transform(bbc_news_test['Text'])
print(bbc_news_train_transformed.shape)
print(bbc_news_test_transformed.shape)
(1490, 5425) (735, 5425)
words = tfidf.get_feature_names()
word_weights = bbc_news_train_transformed.toarray().sum(axis=0)
plot_most_frequent(words, word_weights)
I will be using Non-Negative Matrix Factorization.
tfidf_vec = TfidfVectorizer()
bbc_train_tfidf = tfidf_vec.fit_transform(bbc_news_train['Text'])
bbc_test_tfidf = tfidf_vec.transform(bbc_news_test['Text'])
nmf_model = NMF(n_components = 5)
train_nmf = nmf_model.fit_transform(bbc_train_tfidf)
test_nmf = nmf_model.transform(bbc_test_tfidf)
# logistic regression
logreg = LogisticRegression()
logreg.fit(train_nmf, bbc_news_train['Category'])
#prediction
train_prediction = logreg.predict(train_nmf)
test_prediction = logreg.predict(test_nmf)
train_accuracy = accuracy_score(bbc_news_train['Category'], train_prediction)
print("Train accuracy score:", train_accuracy)
Train accuracy score: 0.8832214765100671
test_df = pd.DataFrame(columns=['ArticleId', 'Category'])
test_df['ArticleId'] = bbc_news_test['ArticleId']
test_df['Category'] = test_prediction
display(test_df.head())
test_df.to_csv("./submission1.csv")
| ArticleId | Category | |
|---|---|---|
| 0 | 1018 | sport |
| 1 | 1319 | tech |
| 2 | 1138 | sport |
| 3 | 459 | business |
| 4 | 1020 | sport |
actual_labels = bbc_news_samp_solution['Category']
test_accuracy = accuracy_score(actual_labels, test_prediction)
print("Test accuracy score:", test_accuracy)
Test accuracy score: 0.1891156462585034
4) Change hyperparameter(s) and record the results. We recommend including a summary table and/or graphs. 5) Improve the model performance if you can- some ideas may include but are not limited to; using different feature extraction methods, fit models in different subsets of data, ensemble the model prediction results, etc.
validation_scores = []
train_scores = []
C_range = np.array([.00000001,.0000001,.000001,.00001,.0001,.001,0.1,
1,10,100,1000,10000,100000,1000000,10000000,100000000,1000000000])
for c in C_range:
my_logreg = LogisticRegression(C = c, random_state=1)
my_logreg.fit(train_nmf,bbc_news_train['Category'])
# train on traning set
train_scores.append(my_logreg.score(train_nmf,bbc_news_train['Category']))
plt.figure()
plt.plot(C_range, train_scores,label="Train Score",marker='.')
plt.xscale('log')
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.legend()
plt.show();
Based on this plot, we can say C=1000 is the model with the best fit.
def tune_nmf_hyperparameters(data_frame):
model_pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('classifier', LogisticRegression())
])
parameter_grid = {
'tfidf__norm': ['l1', 'l2'],
'tfidf__max_df': [0.95],
'tfidf__min_df': [1, 2],
'classifier__C': [1, 10, 100, 1000],
# 'classifier__penalty': ['l1', 'l2'],
# 'classifier__fit_intercept': [True, False],
# 'classifier__class_weight': [None, 'balanced'],
# 'classifier__solver': ['liblinear', 'lbfgs'],
# 'classifier__max_iter': [100, 200],
}
accuracy_scorer = make_scorer(accuracy_score)
grid_search_cv = GridSearchCV(model_pipeline, parameter_grid, cv=5, scoring=accuracy_scorer)
grid_search_cv.fit(data_frame['Text'], data_frame['Category'])
return (grid_search_cv.best_estimator_, grid_search_cv.best_params_)
(best_model, parameters) = tune_nmf_hyperparameters(bbc_news_train)
parameters = {param_key: [param_value] for param_key, param_value in parameters.items()}
pd.DataFrame(parameters)
| classifier__C | tfidf__max_df | tfidf__min_df | tfidf__norm | |
|---|---|---|---|---|
| 0 | 1000 | 0.95 | 1 | l2 |
Using the best parameters from the table above, we can build a model:
train_prediction = best_model.predict(bbc_news_train['Text'])
test_prediction = best_model.predict(bbc_news_test['Text'])
train_accuracy = accuracy_score(bbc_news_train['Category'], train_prediction)
print("Train accuracy score:", train_accuracy)
Train accuracy score: 1.0
Using hyperparameter optimization, the accuracy score has increased to 1. This is not super great as it means the model is perhaps overfitting.
test_df = pd.DataFrame(columns=['ArticleId', 'Category'])
test_df['ArticleId'] = bbc_news_test['ArticleId']
test_df['Category'] = test_prediction
display(test_df.head())
test_df.to_csv("./submission2.csv")
| ArticleId | Category | |
|---|---|---|
| 0 | 1018 | sport |
| 1 | 1319 | tech |
| 2 | 1138 | sport |
| 3 | 459 | business |
| 4 | 1020 | sport |
actual_labels = bbc_news_samp_solution['Category']
test_accuracy = accuracy_score(actual_labels, test_prediction)
print("Test accuracy score:", test_accuracy)
Test accuracy score: 0.1891156462585034
Test accuracy has stayed the same.
1) Pick and train a supervised learning method(s) and compare the results (train and test performance)
For my supervised learning model, I am choosing KNeighborsClassifier as it's suitable for classification and it's simple to implement.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(bbc_news_train_transformed, bbc_news_train['Category'])
KNeighborsClassifier()
train_prediction = knn.predict(bbc_news_train_transformed)
train_accuracy = accuracy_score(bbc_news_train['Category'], train_prediction)
print("Train accuracy score:", train_accuracy)
Train accuracy score: 0.959731543624161
test_prediction = knn.predict(bbc_news_test_transformed)
actual_labels = bbc_news_samp_solution['Category']
test_accuracy = accuracy_score(actual_labels, test_prediction)
print("Test accuracy score:", test_accuracy)
Test accuracy score: 0.1836734693877551
2) Discuss comparison with the unsupervised approach. You may try changing the train data size (e.g., Include only 10%, 20%, 50% of labels, and observe train/test performance changes). Which methods are data-efficient (require a smaller amount of data to achieve similar results)? What about overfitting?
bbc_news_train_50 = bbc_news_train.sample(frac=0.5, random_state=42)
bbc_news_train_50.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 745 entries, 941 to 1253 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ArticleId 745 non-null int64 1 Text 745 non-null object 2 Category 745 non-null object 3 Text_len 745 non-null int64 dtypes: int64(2), object(2) memory usage: 29.1+ KB
bbc_news_test_50 = bbc_news_test.sample(frac=0.5, random_state=42)
bbc_news_samp_solution_50 = bbc_news_samp_solution.sample(frac=0.5, random_state=42)
bbc_news_train_50_transformed = tfidf.transform(bbc_news_train_50['Text'])
bbc_news_train_50_transformed = tfidf.transform(bbc_news_test_50['Text'])
tfidf = TfidfVectorizer(min_df=5, tokenizer=my_tokenizer)
tfidf.fit(bbc_news_train_50['Text'])
bbc_news_train_50_transformed = tfidf.transform(bbc_news_train_50['Text'])
bbc_news_test_50_transformed = tfidf.transform(bbc_news_train_50['Text'])
knn = KNeighborsClassifier()
knn.fit(bbc_news_train_50_transformed, bbc_news_train_50['Category'])
train_prediction = knn.predict(bbc_news_train_50_transformed)
test_prediction = knn.predict(bbc_news_test_50_transformed)
train_accuracy = accuracy_score(bbc_news_train_50['Category'], train_prediction)
print("Train accuracy score:", train_accuracy)
Train accuracy score: 0.9516778523489933
actual_labels = bbc_news_samp_solution['Category']
test_prediction = test_prediction[10:]
test_df = pd.DataFrame(columns=['ArticleId', 'Category'])
test_df['ArticleId'] = bbc_news_test['ArticleId']
test_df['Category'] = test_prediction
display(test_df.head())
test_df.to_csv("./submission3.csv")
| ArticleId | Category | |
|---|---|---|
| 0 | 1018 | politics |
| 1 | 1319 | tech |
| 2 | 1138 | business |
| 3 | 459 | sport |
| 4 | 1020 | tech |
test_accuracy = accuracy_score(actual_labels, test_prediction)
print("Test accuracy score:", test_accuracy)
Test accuracy score: 0.21496598639455783
For this question, I only decieded to do 50% of the data. It can be seen that test data accuracy is better while the train accuracy is not overfitting so I think supervised learning is better. In terms of whether model is data-efficient, I do think that supervised is faster because with unsupervised learning, it's trying to make sense of text data.